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ABSTRACT 

Bitmap indexes must be compressed to reduce input/output costs 
and minimize CPU usage. To accelerate logical operations (AND, 
OR, XOR) over bitmaps, we use techniques based on run-length 
encoding (RLE), such as Word-Aligned Hybrid (WAH) compres- 
sion. These techniques are sensitive to the order of the rows: a sim- 
ple lexicographical sort can divide the index size by 9 and make 
indexes several times faster. We investigate reordering heuristics 
based on computed attribute- value histograms. Simply permuting 
the columns of the table based on these histograms can increase the 
sorting efficiency by 40%. 

Categories and Subject Descriptors 

H. 3.2 [Information Storage and Retrieval]: Information Storage; 
E.l [Data]: Data Structures 

General Terms 

Algorithms, Performance, Experimentation. 

I. INTRODUCTION 

Bitmap indexes are among the most commonly used indexes in 
data warehouses [3, 8]. Without compression, bitmap indexes can 
be impractically large and slow. Word-Aligned Hybrid (WAH) [25] 
is a competitive compression technique: compared to LZ77 [5] and 
Byte-Aligned Bitmap Compression (BBC) [1], WAH indexes can 
be ten times faster [24]. 

Run-length encoding (RLE) and similar encoding schemes (BBC 
and WAH) make it possible to compute logical operations between 
bitmaps in time proportional to the compressed size of the bitmaps. 
However, their efficiency depends on the order of the rows. While 
computing the best ordering is NP-hard [2], simple heuristics such 
as lexicographical sort are effective. 

Pinar et al. [14], Sharma and Goyal [18], and Canahuate et al. [4] 
used Gray-code row sorting to improve RLE and WAH compres- 
sion. However, their largest bitmap index could fit uncompressed 
in RAM on a PC. 

We distinguish two types of heuristics for this problem. Heuris- 
tics such as lexicographical sort [2] or Gray-code sorting [14] are 
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histogram-oblivious. They ignore the number of attribute values 
and their frequencies. Other heuristics are histogram-aware. They 
include column reorganizations and frequency-aware ordering. On 
larger data sets [2], we had considered histogram-oblivious row- 
ordering heuristics. Sorting before indexing reduced the total con- 
struction time. Our main contribution is an evaluation of practi- 
cal histogram-aware heuristics to the row ordering problem. Sec- 
ondary contributions include guidelines about when "unary" bitmap 
encoding is preferred, and an improvement over the naive bitmap 
construction algorithm — it is now practical to construct bitmap in- 
dexes over tables with hundreds of millions of rows and millions of 
attribute values. 

To further reduce the size of bitmap indexes, we can bin the at- 
tribute values [7, 12, 17, 20]. For range queries, different bitmap 
encodings have different space-performance tradeoffs [5,6]. 

2. BITMAP INDEXES 

We find bitmap indexes in several database systems, apparently 
beginning with the MODEL 204 engine, commercialized for the 
IBM 370 in 1972. 

The simplest and most common method of bitmap indexing as- 
sociates a bitmap with every attribute value v of every attribute 
a; the bitmap represents the predicate a = v. For a table with 
n rows (facts) and c columns (attributes/dimensions), each bitmap 
has length n. Initially, all bitmap values are set to 0. For row j, we 
set the j th component of c bitmaps to 1 . If the i' th attribute has 
possible values, we have L = £? =1 rc; bitmaps. 

Bitmap indexes are fast, because we find rows having a given 
value v for attribute a by reading only the bitmap corresponding to 
value v (and not the other bitmaps for attribute a), and there is only 
one bit (or less, with compression) to process for each row. More 
complex queries are achieved with logical operations (AND, OR, 
XOR, NOT) over bitmaps and current microprocessor can perform 
32 or 64 bitwise operations in a single machine instruction. 

For row j, exactly one bitmap per column will have its f h entry 
set to 1 . Although the entire index has nL bits, there are only nc 1 's; 
for many tables, L^> c and thus on average the table is very sparse. 
Long (hence compressible) runs of 0's are expected. 

One can also reduce the number of bitmaps for large dimensions. 
Given L bitmaps, there are L(L — 1 ) /2 pairs of bitmaps. So, instead 
of mapping an attribute value to a single bitmap, we map them to 
pairs of bitmaps (see Table 1). We refer to this technique as 2-of- 
N encoding [23]; with it, we can use far fewer bitmaps for large 
dimensions. For instance, with only 2,000 bitmaps, we can rep- 
resent an attribute with 2 million distinct values. But the average 
bitmap density is much higher with 2-of-/V encoding, and thus com- 
pression may be less effective. More generally, k-of-N encoding 



Table 1: Example of 1-of-N and 2-of-N encoding 

Montreal 100000000000000 110000 
Paris 010000000000000 101000 

Toronto 001000000000000 100100 
New York 000100000000000 011000 
Berlin 000010000000000 010100 

allows L bitmaps to represent ft) distinct values; conversely, us- 
ing L = \kn; ~\ bitmaps is sufficient to represent rij distinct values. 
However, searching for a specified value v no longer requires scan- 
ning a single bitmap. Instead, the corresponding k bitmaps must be 
combined with a bitwise AND. There is a tradeoff between index 
size and the index speed [2], 

For small dimensions, using k-of-N encoding may fail to reduce 
the number of bitmaps, but still reduce the performance. We apply 
the following heuristic. Any column with less than 5 distinct values 
is limited to l-of-N encoding (simple or unary bitmap). Any col- 
umn with less than 2 1 distinct values, is limited to k = 1,2, and any 
column with less than 85 distinct values is limited to k = 1,2,3. 

3. COMPRESSION 

RLE compresses efficiently when there are long runs of iden- 
tical values: it works by replacing any repetition by the number 
of repetitions followed by the value being repeated. For example, 
the sequence 11110000 becomes 4140. Current microprocessors 
perform operations over words of 32 or 64 bits and not individual 
bits. Hence, the CPU cost of RLE might be large [19]. By trad- 
ing some compression for more speed, Antoshenkov [1] defined a 
RLE variant working over bytes instead of bits: the Byte- Aligned 
Bitmap Compression (BBC). Trading even more compression for 
even more speed, Wu et al. [25] proposed the Word- Aligned Hybrid 
(WAH). Their scheme is made of two different types of words 1 . 
The first bit of every word distinguishes a verbatim (or dirty ) 31- 
bit word from a running sequence of 31 -bit clean words (0x00 or 
lxll). Running sequences are stored using 1 bit to distinguish be- 
tween the type of word (0 for 0x00 and 1 for lxll) and 30 bits to 
represent the number of consecutive clean words. Hence, a bitmap 
of length 62 containing a single 1-bit at position 32 would be coded 
as the words 100x01 and 010x00. Because dirty words are stored 
in units of 31 bits using 32 bits, WAH compression can expand the 
data by 3%. We created our own WAH variant called Enhanced 
Word-Aligned Hybrid (EWAH). Contrary to WAH compression, 
EWAH may never (within 0.1%) generate a compressed bitmap 
larger than the uncompressed bitmap. It also uses only two types 
of words (see Fig. 1). The first type is a 32-bit verbatim word. The 
second type of word is a marker word: the first bit is used to indi- 
cate which clean word will follow, 16 bits to store the number of 
clean words, and 15 bits to store the number of dirty words follow- 
ing the clean words. EWAH bitmaps begin with a marker word. 

Given L bitmaps and n rows, we can naively construct a bitmap 
index in time OinV) by appending a word to each compressed 
bitmap every 32 or 64 rows. We found this approach impractically 
slow when L was large — typically, with k = 1. Instead, we con- 
struct bitmap indexes in time Oinck + L) = 0(nck) [2] where ck 
is the number of true values per row (See Algorithm 1): within 
each block of 32 rows, we store the values of the bitmaps in a 
set — omitting any unsolicited bitmap, whose values are all false 
(0x00). We partition the table horizontally into blocks indexed with 
compressed bitmaps using a fixed memory budget (256 MiB). Each 
block of bitmaps is written sequentially and preceded by an array 
of 4-byte integers containing the location of each bitmap. 

'For simplicity, we limit our exposition to 32 bit words. 



a) an example bitmap being compressed (5456 bits) 

loooooaooi 1 1 oaaooi i aaooi 1 1 aoooi i 32 bits 

0OO0Q000OG000....00OG0000QO0Q000O0 5392 bits 
0O1 1 1 1 11 1 1 1 1 0000000000000 1 1 1 000 1 3 2 bits 
b} dividing thebirnun into 32-bit gfoups 
1 00000000 1 1 1 0O0D0 11 0000 1 1 1 0000 11 g reu p 1 : 32 bits 
OOOOOQ«MOQao....n J onoon(Mjnrj(iaQOODO g rou p 2-1 75: 1 74*32 bits 
001 1 1 11 11 11 1 OOQ000OO0QQ00 11 1 0001 g rou p 1 76: 32 bits 
c) EWAH encoding 
OOOOOOOOOOOOOOODi'000000000000001 marker-word 
1O0000OO01 1 1QO0001 100001 1 100001 1 dirty woicl 
pOOaniQIOIOQfl lffl- 11 " 000000000000001 msrker-word 



dirty word 



001 1 1 1 1 1 1 1 1 100000000000001 1 1 0001 

I— number of clean words: 10 bits 
type of the dean words: 1 bit 
number nf dirty words following clean words: 15 bits 



Figure 1: Enhanced Word- Aligned Hybrid (EWAH) 

Algorithm 1 Constructing bitmaps. For simplicity, we assume the 
number of rows is multiple of the word size. 
Construct: B[,... ,Bl, L compressed bitmaps 
to, «- for 1 < i < L. 
c <— 1 {row counter} 
5\£ < — { fA^ records the dirtied bitmaps ) 
for each table row do 

for each attribute in the row do 

for each bitmap i corresponding to the attribute value do 
set to true the (c mod w) th bit of word CO, 
#<-#U{i} 
if c is a multiple of w then 
for i in 5\£ do 

add c/w — \Bj \ — 1 clean words (0x00) to B, 
add the word CO, to bitmap B, 
CO, <- 

<v:<-o 

c <— c + 1 
for /m {1,2,..., L} do 

add c/w — \Bj\ — 1 clean words (0x00) to B, 

Naively, we could compute logical operations between 2 bitmaps 
in n/32 bitwise operations. Instead, we compute logical operations 
(OR, AND, XOR) between 2 bitmaps in time 0(|5i | + \B 2 \) where 
\Bj\ is the size of the compressed bitmap [2,25]. Finally, we can 
bound the bitmap sizes: | A/^/l < mm i \Bi\ and | Vi^il < Ei \Bi\- 

4. SORTING TO IMPROVE COMPRESSION 

Sorting can benefit bitmap indexes at several levels. We can sort 
the rows of the table. The sorting order depends itself on the order 
of the table columns. And finally, we can allocate the bitmaps to 
the attribute values in sorted order. 

4.1 Sorting rows 

Reordering the rows of a compressed bitmap index can improve 
compression. Whether using RLE, BBC, WAH or EWAH, the 
problem is NP-hard by reduction from the Hamiltonian path prob- 
lem [2, Theorems 1 and 2]. A simple heuristic begins with an un- 
compressed index. Rows (binary vectors) are then rearranged to 
promote runs. In the process, we may also reorder the bitmaps. 
This is the approach of Canahuate et al. [4], but it uses Q(nL) time. 
For the large dimensions and number of rows we have considered, 
it is infeasible. A more practical approach [2] is to reorder the table, 
then construct the compressed index directly; we can also reorder 
the table columns prior to sorting. 

Three types of ordering can be used for ordering rows. We may 
cluster identical rows, but it is not a competitive heuristic [2], 

• In lexicographic order, a sequence a\ ,a 2 , ■ ■ ■ is smaller than 
another sequence b\ ,b%, . . . if and only if there is a j such that 
cij < bj and a/ = bj for i < j. The Unix sort command pro- 
vides an efficient mean of sorting flat files into lexicographic 



order; in under 10 s our test computer (see Section 6) sorted a 
5-million-line, 120 MB file. SQL supports lexicographic sort 
via ORDER BY. 

• Gray-code (GC) sorting is defined over bit vectors [14]: the 
sequence a 1,02,... is smaller than b\,b2,--- if and only if 
there exists j such that 2 aj = a.\ © 02 ffi . . . ffi «/_ 1 , bj ^ aj, 
and a; = b/ for i < j. Algorithm 2 shows how to compare 
sparse GC bit vectors vi and V2 in time 0(min(|vi |, | V2 1 ) 
where |v, | is the number of true value in bit vector v,-. Sort- 
ing the rows of a bitmap index without materializing the un- 
compressed bitmap index is possible [2]: we implemented 
an 0(nck\ogn)-time solution for k-of-N indexes using an 
external-memory B-tree [10]. Unfortunately, it proved to be 
two orders of magnitude slower than lexicographic sort. 



Algorithm 2 Gray-code less comparator between sparse bit vectors 
INPUT: arrays a and b representing the position of the ones in 
two bit vectors 

OUTPUT: whether the bit vector represented by a is less than 
the one represented by b 

f *— true 

m «— min(length(a),length(i)) 
for p in 1 , 2, . . . , m do 

return / if a p > b p and -1/ if a p < b p 

return -if if length(a) > lengfh(fo), / if length(b) > lengfh(a), 
and false otherwise 



For RLE, the best ordering of the rows of a bitmap index min- 
imizes the sum of the Hamming distances: Y,iMri, r i+\) where r,- 
is the i row, for h{x,y) = \{i\xi ^ y;}|. If all 2 L different rows 
are present, the GC sort would be an optimal solution to this prob- 
lem [14]. The following proposition shows that GC sort is also 
optimal if all PT\ k-of-N codes are present. The same is not true of 
lexicographic order when k > 1: 0110 immediately follows 1001 
among 2-of-4 codes, but their Hamming distance is 4. 

Proposition 1 We can enumerate, in GC order, all k-of-N codes in 
time 0(kCl)) (optimal complexity). Moreover, the Hamming dis- 
tance between successive codes is minimal ( =2). 

PROOF. Let a be an array of size k indicating the positions of 
the ones in k-of-N codes. As the external loop, vary the value a\ 
from 1 to N — k + 1. Within this loop, vary the value 02 from N — 
k + 2 down to a\ + 1, Inside this second loop, vary the value of 
03 from «2 + 1 up to N — k + 3, and so on. By inspection, we see 
that all possible codes are generated in increasing GC order. To see 
that the Hamming distance between successive codes is 2, consider 
what happens when a, completes a loop. Suppose that i is odd and 
greater than 1, then a, had value N — k + i and it will take value 
a,_i + 1. Meanwhile, by construction, a; + \ (if it exists) remains at 
value N — k + i+1 whereas a, + 2 remains at value N — k + i + 2 and 
so on. The argument is similar if i is even. □ 

For a given column, suppose that in a block of 32 rows, we have 
j distinct attribute values. We computed the average number of 
bitmaps that would have a dirty word (see Fig. 2). Comparing k- 
of-N codes that were adjacent in GC ordering against k-of-N codes 
that were lexicographically adjacent, the difference was insignifi- 
cant for k — 2. However, GC ordering is substantially better for 

2 The symbol © is the XOR operator. 
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Figure 2: Probabilities that a bitmap will contain a dirty word, 
when several (x-axis) of 1000 possible attribute values are found 
in a 32-row chunk. Effects are shown for values with k-of-N 
codes that are adjacent in GC order, adjacent in lexicographic 
order, or randomly selected. 

k > 2, where bitmaps are denser. Selecting the codes randomly 
is disastrous. Hence, sorting part of a column — even one without 
long runs of identical values — improves compression for k > 1 . 

For encodings like BBC, WAH or EWAH, GC sorting is not opti- 
mal, even when all k-of-N codes are present. For example consider 
the sequence of rows 1001, 1010, 1100, 0101, 0101, 0110, 0110, 
0011. Using 4-bit words, we see that a single bitmap contains a 
clean word (0000) whereas by exchanging the fifth and second row, 
we get two clean words (0000 and 1111). 

4.2 Sorting bitmap codes 

For a simple index, the map from attribute value to bitmaps is 
inconsequential; for k-of-N encodings, some bitmap allocations are 
more compressible: consider an attribute with two overwhelmingly 
frequent values and many other values that occur once each. If 
the table rows are given in random order, the two frequent values 
should have codes that differ as little as possible. 

There are several ways to allocate the bitmaps. Firstly, the at- 
tribute values can be visited in alphabetical or numerical order, or — 
for histogram-aware schemes — in order of frequency. Secondly, 
the bitmap codes can be used in different orders. We consider lex- 
icographical ordering (1 100, 1010, 1001, 01 10, ... ) and GC order 
( 1 00 1 , 1 1 0, 1 1 00, 1 1 , ... ) ordering (see proof of Proposition 1 ) . 
For dense low-dimensional tables, GC order is preferable [2] and 
its compression effects are comparable to sorting the index rows in 
GC order. Meanwhile, it is technically easier to implement since 
we can sort the table lexicographically and only use GC ordering 
during the bitmap index construction. 

Alpha-Lex denotes sorting the table lexicographically and as- 
signing bitmap codes so that the fi 1 attribute gets the lexicographi- 
cally ; th smallest bitmap code. Gray-Lex is similar, except that the 
! th attribute gets the rank-i bitmap code in GC order. These two ap- 
proaches are histogram oblivious — they ignore the frequencies of 
attribute values. 

Knowing the frequency of each attribute value can improve code 
assignment when k > 1. For instance, clustering dirty words in- 
creases the compressibility. Within a column, Alpha-Lex and Gray- 
Lex order runs of identical values irrespective of the frequency: the 
sequence af cccadeaceabe may become aaaabccccdeeef . For 
better compression, we should order the attribute values — within 
a column — by their frequency (e.g., aaaacccceeebdf ). Allocat- 
ing the bitmap codes in GC order to the frequency-sorted attribute 
values, our Gray-Freq uency sorts the table rows as follows. Let 
/(a;) be the frequency of attribute a,-. Instead of sorting the ta- 
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Figure 3: Storage gain in words for sorting a given column 
with 100,000 rows and various number of attribute values 

(25(foi, \kn/ k ],n)-4m). 

ble rows a\ ,«2, ■ • ■ , a d> we lexicographically sort the extended rows 
f(a\),a\,f(a2),a2, ■ ■ ■ ,f(ftd)i a d by comparing the frequencies by 
their numerical value. The frequencies /(a/) are discarded prior to 
indexing. 

4.3 Choosing the column order 

Lexicographic table sorting uses the ; th column as the i™ sort 
key: it uses the first column as the main key, the second column 
to break ties when two rows have the same first component, and so 
on. Some column orderings lead to smaller indexes than others [2]. 

We model the storage cost of a bitmap index as the sum of the 
number of dirty words and the number of sequences of identical 
clean words (lxl 1 or 0x00). If a set of L bitmaps has x dirty words, 
then there are at most L + x sequences of clean words; the stor- 
age cost is at most 2x + L. This bound will be tighter for sparser 
bitmaps. Because the simple index of a column has at most n 1-bits, 
it has at most n dirty words, and thus, the storage cost is at most 3n. 
The next proposition shows that the storage cost of a sorted column 
is bounded by 5«,-. 

Proposition 2 Using GC-sorted k-of-L codes, a sorted column with 
tij distinct values has no more than 2ni dirty words, and the storage 
cost is no more than Ani + \kn^ k ] . 

For k = 1 , Proposition 2 is true irrespective of the order of the 
values, as long as identical values appear sequentially. Another ex- 
treme is to assume that all 1-bits are randomly distributed. Then 
sparse bitmap indexes have as 8(r,L,n) = (1 — (1 — jz) w )^B dirty 
words where r is the number of 1-bits, L is the number of bitmaps 
and w is the word length (w = 32). Hence, we have an approxi- 
mate storage cost of 28+ [faij ]. The gain of column C is the 
difference between the expected storage cost of a randomly row- 
shuffled C, minus the storage cost of a sorted C. We estimate the 

gain by 28(kn, \knj k \ , n) — An, (see Fig. 3) for columns with uni- 
form histograms. The gain is modal: it increases until a maximum 
is reached and then it decreases. The maximum gain is reached at 
« (b(w- l)/2)* /( * +1) : for n = 100,000 and w = 32, the maximum 
is reached at ~ 1 , 200 for k = 1 and at ~ 13, 400 for k = 2. Skewed 
histograms have a lesser gain for a fixed cardinality n, . 

After lexicographic sorting, the i th column is divided into at most 
■ • • sorted blocks. Hence, it has at most 2n\ •■■Hi dirty 
words. When the distributions are skewed, the z th column will have 
blocks of different lengths and their ordering depends on how the 
columns are ordered. To assess these effects, we generated data 
with 4 independent columns: using uniformly distributed dimen- 
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(b) Zipfian data with skew parameters 1.6, 1.2, 0.8 and 
0.4 



Figure 4: Index sizes in words for various dimension orders on 
synthetic data (100,000 rows). Zipfian columns have 100 dis- 
tinct values. Ordering "1234" indicates ordering by descending 
skew (Zipfian) or ascending cardinality (uniform). 

sions of different sizes (see Fig. 4(a)) and using same-size dimen- 
sions of different skew (see Fig. 4(b)). We then determined the 
Gray-Lex index size for each of the 4! different dimension order- 
ings. Based on these results, for sparse indexes (k = 1), dimensions 
should be ordered from least to most skewed, and from smallest to 
largest; whereas the opposite is true for k > 1. 

A sensible heuristic might be to sort columns by increasing den- 

— l/k " I Ik 

sity (~ n- t ). However, a very sparse column (n- 2> w) will 

not benefit from sorting (see Fig. 3) and should be put last. Hence, 

we use the following heuristic: columns are sorted in decreasing 



order with respect to min(n ; 1 , (1 — n i )/ (Aw — 1)): this func- 
tion is maximum at density n i l ' k = l/(4w) and it goes down to 
zero as the density goes to 1. In Fig. 4(a), this heuristic makes the 
best choice for all values of k. We consider this heuristic further in 
Section 6.3. 



4.4 Avoiding column order 

As an alternative to lexicographic sort and column reordering, 
we introduce Frequent-Component sorting, which uses histograms 
to help sort without bias from a fixed dimension ordering. In sort- 
ing, we compare the frequency of the ft 1 most frequent attribute 
values in each of two rows without regard (except for possible tie- 
breaking) to which columns they come from. With appropriate pre- 
and post-processing, it is possible to implement this approach using 
a standard sorting utility such as Unix sort. 



5. PICKING THE RIGHT K-OF-N 

Choosing k and N are important decisions. We choose a single 
k value for all dimensions 3 , leaving the possibility of varying k 
by dimension as future work. Larger values of k typically lead to 
a smaller index and a faster construction time — although we have 
observed cases where k = 2 makes a larger index. However, query 
times increase with k: there is a construction time/speed tradeoff. 

Larger k makes queries slower. 

We can bound the additional cost of queries. Assume ft) = n,. 
A given fc-of-L, bitmap is the result of an OR operation over at most 

krii/Li < 3n-* unary bitmaps. Because | V/^il — E;l-^il> me 

expected size of such a bitmap is no larger than 3nJ times the 
expected size of a unary bitmap. A query looking for one attribute 
value will have to AND together k of these denser bitmaps. The 
entire ANDing operation can be done by k — 1 pairwise ANDs that 
produce intermediate results whose EWAH sizes are increasingly 
small: 2k— 1 bitmaps are thus processed. Hence, the expected time 
complexity of an equality query on a dimension of size is no 

more than 3 (2k — 1 )n ; * times higher than the expected cost of the 
same query on a k = 1 index. 

For a less pessimistic estimate of this dependence, observe that 
indexes seldom increase in size when k grows. We may conserva- 
tively assume that index size is unchanged when k changes. There- 
fore the expected size of one bitmap grows as ~ n j 1 /k, leading 

to queries whose cost is proportional to (2— l/k)n i 1 ^ k . Relative 
to the cost for k = 1, which is proportional to 1/n,-, we can say that 

(k— l)/k 

increasing k leads to queries that are(2 — 1 jk)n\ times more 
expensive than on a simple bitmap index. 

For example, suppose n,- = 100, going from k = 1 to k = 2 should 
increase query cost about 15 fold but no more than 90 fold. In sum- 
mary, the move from k = 1 to anything larger can have a dramatic 
negative effect on query speeds. Once we are at k = 2, the incre- 
mental cost of going to k = 3, k = 4 is not so high: whereas the 

ratio k = 2/k— 1 goes as ^/nj, the ratio k = 3/k — 2 goes as nj b . 

Larger k makes indexes smaller. 

Consider the effect of a length 100 run of values v\ , followed by 
100 repetitions of V2, then 100 of V3, etc. Regardless of k, whenever 
we switch from v\ to v i+l at least two bitmaps will have to make 
transitions between and 1. Thus, unless the transition appears 
at a word boundary, we create at least 2 dirty words whenever an 
attribute changes from row to row. The best case, where only 2 
dirty words are created, is achieved when k = 1 for any assignment 
of bitmap codes to attribute values. For k > 1 and N as small as 
possible, it may not be possible to achieve so few dirty words, or it 
may require a particular assignment of bitmap codes to values. 

Encodings with k > 1 find their use when many (e.g. 15) attribute 
values fall within a word-length boundary. In that case, a k = 1 
index will have at least 15 bitmaps with transitions (and we can 
anticipate 15 dirty words). However, if there were only 45 possible 
values in the dimension, we would not need more than 10 bitmaps 
with k = 2. Hence, there would be at most 10 dirty words and 
maybe less if we have sorted the data (see Fig. 2). 

Choosing N. 

It seems intuitive, having chosen k, to choose A' to be as small as 
possible. Yet, we have observed cases where the resulting 2-oi-N 

3 Except that for columns with small rc;, we automatically adjust k 
downward when it exceeds the limits noted at the end of Section 2. 



Table 2: Characteristics of data sets used. 







rows 


cols 


V m. 

Li«! 


size 


Census-Income 




199 


523 


42 


103 419 


99.1MB 


4-d projection 




199 


523 


4 


102 609 


2.96 MB 


DBGEN 


13 


977 


980 


16 


4 411 936 


1.5 GB 


4-d projection 


13 


977 


980 


4 


402 544 


297 MB 


Netflix 


100 


480 


507 


4 


500 146 


2.61GB 


KJV-4grams 


877 


020 


839 


4 


33 553 


21.6GB 



indexes are much bigger than 1-of-iV indexes. Theoretically, this 
could be avoided if we allowed larger A', because one could aways 
append an additional 1 to every attribute's 1-of-JV code. Since this 
would create one more (clean) bitmap than the l-of-N index has, 
this 2-of-N index would never be much larger than the 1-of-iV in- 
dex. So, if N is unconstrained, we can see that there is never a 
significant space advantage to choosing k small. 

Nevertheless, the main advantage of k > 1 is fewer bitmaps. We 
choose N as small as possible. 

6. EXPERIMENTAL RESULTS 

We present experiments to assess the effects of various factors 
(choices of k, sorting approaches, dimension orderings) in terms of 
EWAH index sizes. These factors also affect index creation and 
query times (we report real wall-clock times). 

6.1 Platform 

Our test programs 4 were written in C++ and compiled by GNU 
GCC 4.0.2 on an Apple Mac Pro with two double-core Intel Xeon 
processors (2.66 GHz) and 2GiB of RAM. Lexicographic sorts 
of flat files were done using GNU coreutils sort version 6.9. For 
all tests involving k = 1, we used the sparse implementation ap- 
proached in Section 3 because without it, the Gray-Lex index cre- 
ation times were 20-100 times slower, depending on the data set. 

6.2 Data sets used 

We primarily used four data sets, whose details are summarized 
in Table 2: Census-Income [9], DBGEN [21], KJV-4grams, and 
Netflix [13]. DBGEN is a synthetic data set, whereas KJV-4grams 
is a large list (including duplicates) of 4-tuples of words obtained 
from the verses in the King James Bible [16], after stemming with 
the Porter algorithm [15] and removal of stemmed words with three 
or fewer letters. Occurrence of row w\ , wi , W3 , W4 indicates that the 
first paragraph of a verse contains words w\ through W4, in this or- 
der. This data is a scaled-up version of word co-occurrence cubes 
used to study analogies in natural language [1 1, 22]. Each of KJV- 
4grams' columns contains roughly 8 thousand distinct stemmed 
words. The Netflix table has 4 dimensions: UserlD, MovielD, Date 
and Rating, having cardinalities 5, 2 182, 17 770, and 480 189. De- 
tails of how it was obtained from the data downloaded are given 
elsewhere [2]. 

For some of our tests, we chose four dimensions with a wide 
range of sizes. For Census-Income, we chose age (di), wage per 
hour ((I2), dividends from stocks (d^) and a numerical value 5 found 
in the 25 th position (^4). Their respective cardinalities were 91, 
1 240, 1 478 and 99 800. For DBGEN, we selected dimensions of 
cardinality 7, 11,2 526 and 400 000. Dimensions are numbered by 
increasing size: column 1 has fewer distinct values. 



4 http : //code . google . com/p/lemurbitmapindex/. 

5 The associated metadata says this column should be a 10-valued 

migration code. 



6.3 Column Ordering 

Fig. 5 shows the Gray-Lex index sizes for each column order- 
ing. The dimensions of KJV-4grams are too similar for ordering to 
be interesting, and we have thus omitted them. For small dimen- 
sions, the value of k was lowered using the heuristic presented in 
Section 2. Our results suggest that table-column reordering has a 
significant effect (40%). This does not contradict the observation 
by Canahuate et. al [4] that bitmap reordering does not change the 
size much. 

The value of k affects which ordering leads to the smallest index: 
good orderings for k — 1 are frequently bad orderings for k > 1, 
and vice versa. This is consistent with our earlier analysis (see 
Figs. 3 and 4). For Netflix and DBGEN, we have omitted k = 2 for 
legibility: it is inferior to k = 1 for most orderings. 

Census-Income's largest dimension is very large {n\ ~ n/2); DB- 
GEN has also a large dimension (W4 ~ n/35). Sorting columns in 

decreasing order with respect to min(n ( . n i '^*)/(4w — 1)) 

for k = 1, we have that only for DBGEN the ordering "2134" is 
suggested, otherwise, "1234" is recommended. Thus the heuristic 
provides nearly optimal recommendations. For k = 3 and k = 4, 
the ordering "1234" is recommended for all data sets: for k — 4 
and Census-Income, this recommendation is wrong. For k = 2 
and Census-Income, the ordering "3214" is recommended, another 
wrong recommendation for this data set. Hence, a better column 
reordering heuristic is needed for k > 1. The difficulty appears 
to be fundamental: when we calculated the gain experimentally, 
we found that the best orderings sometimes did not have the di- 
mensions with highest gain first. Our greedy approach may be too 
simple, and it it may be necessary to know the histogram skews. 



6.4 Sorting 

On some synthetic Zipfian tests, we found a small improvement 
(less than 4% for 2 dimensions) by using Gray-Lex coding in pref- 
erence to Alpha-Lex [2, Fig. 3]. On other data sets, Gray-Lex either 
had no effect or a small positive effect. Therefore, our current ex- 
periments do not include Alpha-Lex, with the exception that we 
experimentally evaluated how sorting affects the EWAH compres- 
sion of individual columns. Whereas sorting tends to create runs 
of identical values in the first columns, the benefits of sorting are 
far less apparent in later columns, except those strongly correlated 
with the first few columns. For Table 3, we have sorted projections 
of Census-Income and DBGEN onto 10 dimensions d\ . ..d\Q with 
n\ < ... <niQ. (The dimensions d\...dt\'m this group are different 
from the dimensions d\...d\ discussed earlier.) We see that if we 
sort from the largest column {d\o . . - d\), at most 3 columns bene- 
fit from the sort, whereas 5 or more columns benefit when sorting 
from the smallest column (d\ . . . d\§). 

Lexicographic sorting. 

Constructing a simple bitmap index (using Gray-Lex) over KJV- 
4grams took approximately 14,000 seconds or less than four hours. 
Nearly half (6,000 s) of the time was due to the sort utility, since 
the data set is much larger than the machine's main memory (2 GiB). 
Constructing an unsorted index is faster (approximately 10,000 s), 
but the index is about 9 times larger. 

To study scaling, we built indexes from prefixes of the full data 
set. We found construction times increased linearly with index size 
for k = 1 , whether or not sorting was used. For 1 < k < 4, index 
size increased linearly with the prefix size for unsorted data. Yet 
with sorting, index size increased sublinearly. As new data arrives, 
it is increasingly likely to fit into existing runs, once sorted. 



Table 4: Sizes of EWAH indexes (32-bit words) for various sort- 
ing methods. 





LcA Liiisui icu. \jy dy-ijCA uidy-ricL^ 


Census-Income k = 1 
(4d) 2 
3 

A 


8.49 x 10 5 4.87 x 10 5 4.87 x 10 5 
9.12xl0 5 4.52 x 10 5 4.36 x 10 5 
6.90 x 10 s 3.73 x 10 5 3.28 x 10 5 
,1 co v 1 n5 t n v 1 1 qc v 1 a5 

4.JO X 1U Z.1/X1U 1 .70 X 1U 


DBGEN 1 
(4d) 2 
3 
4 


5.48 x 10 7 3.38 x 10 7 3.38 x 10 7 
7.13 x 10 7 2.76 x 10 7 2.74 x 10 7 
5.25 x 10 7 1.50 x 10 7 1.50 x 10 7 
^ ?4 v 1 n 7 1 9 1 v 1 n 7 1 1 q v 1 n 7 

J.it A 1U A 1U 1.17 A 1U 


Netflix 1 
2 
3 
4 


6.20 x10 s 3.22 x 10 s 3.19 x10 s 
8.27 x 10 s 3.17xl0 8 2.43 x 10 s 
5.73 x10 s 1.97 x10 s 1.49 x10 s 
3.42 x10 s 1.37 x10 s 1.14 x10 s 


KJV-4grams 1 
2 
3 
4 


6.08 x 10* 6.68 x 10 s 6.68 x 10 s 
8.02 x 10 9 9.93 x 10 s 7.29 x 10 s 
4.13 x 10 9 8.31 x 10 s 5.77 x10 s 
2.52 x 10 9 6.39 x 10 s 5.01x10 s 



Table 4 shows index sizes for our large data sets, using Gray-Lex 
orderings and Gray-Frequency. Dimensions were ordered from the 
largest to the smallest ("4321") except for Census-Income where 
we used the ordering "3214". We observed that KJV-4grams did 
not benefit in index size for k = 2. This data set has many very long 
runs of identical attribute values in the first two dimensions, and the 
number of attribute values is modest, compared with the number of 
rows. This is ideal for l-of-/V. 

Gray-Frequency yields the smallest indexes in Table 4. Frequent- 
Component is not shown in the table. On Netflix for k = 1 it outper- 
formed the other approaches by 1%, and for DBGEN it was only 
slightly worse than the others. But in all other case on DBGEN, 
Census-Income and Netflix, it lead to indexes 5-50% larger. 

6.5 Queries 

We timed equality queries against our 4-d bitmap indexes, and 
the results are shown in Fig. 6. Queries were generated by choosing 
attribute values uniformly at random and the figures report average 
times for such queries. We made 100 random choices per column 
for KJV-4grams when k > 1. For DBGEN and Netflix, we had 
1,000 random choices per column and 10,000 random choices were 
used for Census-Income and KJV-4grams (k = 1). For each data 
set, we give the results per column (leftmost tick is the column 
used as the primary sort key, next tick is for the secondary sort key, 
etc.). 

From Fig. 6(b), we see that simple bitmap indexes always yield 
the fastest queries. The difference caused by k is highly depen- 
dent upon the data set and the particular column in the data set. 
However, for a given data set and column, with only a few small 
exceptions, query times increase significantly with k. For DBGEN, 
the last two dimensions have size 7 and 11, whereas for Netflix, 
the last dimension has size 5, and therefore, they will never use a 
fe-value larger than 2: their speed is mostly oblivious to k. 

In Section 5, we predicted that the query time would grow with 
k as ~ (2 — l/k)n i '^: for the large dimensions such as the largest 
ones for DBGEN (400k) and Netflix (480k), query times are two 
orders of magnitude slower for k = 2 as opposed to k = 1, and four 
orders of magnitude slower for k = 4. Thus, our model exaggerates 
the differences by about an order of magnitude. The most plausible 
explanation is that query times are not directly proportional to the 
bitmap loaded, but also include a constant factor. 

Fig. 6(a) and 6(b) show the equality query times per column be- 
fore and after sorting the tables. Sorting improves query times most 
for larger values of k: for Netflix, sorting improved the query times 
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Figure 5: Index sizes (words, y axis) on 4-d data sets for all dimension orderings (x axis). 



Table 3: Number of 32-bit words used for different unary indexes when the table was sorted lexicographically (dimensions ordered 
by descending cardinality, d\p...d\,or by ascending cardinality, d\ . . . dip). 







Census-Income 






DBGEN 






cardinality 


unsorted 


di ...dw 


d l0 ...di 


cardinality 


unsorted 


d\ ...dio 


d w ...d\ 


di 


7 


42 427 


32 


42 309 


2 


0.75 xlO 6 


24 


0.75 xlO 6 


d 2 


8 


36 980 


200 


36 521 


3 


1.11 x 10 6 


38 


l.llxlO 6 


di 


10 


34 257 


1 215 


28 975 


7 


2.58 xlO 6 


150 


2.78 xlO 6 


dt\ 


47 


0.13xl0 6 


12 118 


0.13xl0 6 


9 


0.37 xlO 6 


100 6 


3.37xl0 6 


d 5 


51 


35 203 


17 789 


28 803 


11 


4.11xl0 6 


10 824 


4.11xl0 6 


d 6 


91 


0.27 xlO 6 


75 065 


0.25 xlO 6 


50 


13.60xl0 6 


0.44 xlO 6 


1.42xl0 6 


d 1 


113 


12 199 


9 217 


12 178 


2 526 


23.69 xlO 6 


22.41 xlO 6 


23.69xl0 6 


d% 


132 


20 028 


14 062 


19 917 


20 000 


24.00 xlO 6 


24.00 xlO 6 


22.12xl0 6 


d 9 


1 240 


29 223 


24 313 


28 673 


400 000 


24.84xl0 6 


24.84xl0 6 


19.14xl0 6 


d\o 


99 800 


0.50 xlO 6 


0.48 xlO 6 


0.30xl0 6 


984 297 


27.36xl0 6 


27.31 xlO 6 


0.88 xlO 6 


total 




l.llxl0 b 


0.64 xlO 6 


0.87xl0 b 




0.122xlO y 


0.099 xlO y 


0.079 x 10 y 



by at most 2 for k = 1, at most 40 for k = 2 and at most 140 for 
k = 3; indexes with k > 1 benefit from sorting even when there are 
no long runs of identical values (see Subsection 4.1). (On the first 
columns, k = 3 usually gets the best improvements from sorting.) 
Synthetic DBGEN showed no significant speedup from sorting, be- 
yond its large first column. Although Netflix, like DBGEN, has a 
many-valued column first, it shows a benefit from sorting even in its 
third column: in fact, the third column benefits more from sorting 
than the second column. The largest table, KJV-4grams, benefited 
most from the sort: while queries on the last column are twice as 
fast, the gain on the first two columns ranges from 20 times faster 
(k = 1) to almost 1500 times faster (k = 3). 

We can compare these times with the expected amount of data 
scanned per query. This is shown in Figure 7, and we observe 
reasonably close agreement between most query times and the ex- 
pected sizes of the bitmaps being scanned. Exceptions include the 
first dimension on KJV-4grams and some cases where the bitmaps 
are tiny. This discrepancy might be explained by the retrieval of 
the row IDs from the compressed bitmaps: long runs of lxl 1 clean 
words must be converted to many row IDs. 

7. GUIDELINES FOR K 

Our experiments indicate that simple (k = 1) bitmap encoding is 
preferable when storage space and index-creation time are less im- 
portant than fast equality queries. The storage and index-creation 
penalties are kept modest by table sorting and Algorithm 1 . 

Space requirements can be reduced by choosing k > 1 , although 
Tab. 4 shows that this approach has risks (see KJV-4grams). For k > 
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Figure 7: Bitmap data examined per equality query. 

1, we can gain additional index size reduction at the cost of longer 
index construction by using Gray-Frequency rather than Gray-Lex. 

If the total number of attribute values is small relative to the num- 
ber of rows, then we should first try the k = 1 index. Perhaps the 
data set resembles KJV-4grams. Besides yielding faster queries, 
the k = 1 index may be smaller. 

8. CONCLUSION AND FUTURE WORK 

We showed that while sorting improves bitmap indexes, we can 
improve them even more (30-40%) if we know the number of dis- 
tinct values in each column. For k-of-N encodings with k > 1, even 
further gains (10-30%) are possible using the frequency of each 
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Figure 6: Query times are affected by dimension, table sorting and k. 



value. Regarding future work, the accurate mathematical modelling 
of compressed bitmap indexes remains an open problem. 
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